Predicting NFL Play Calling Tendencies

Final Project
Data Science 2 with R (STAT 301-2)

Author

Alex Boyko

Published

March 14, 2024

Introduction

In American Football (NFL), an offense can choose either to run or pass the ball when trying to move down the field and score. However, even when a discussion of play calling is reduced to a simple binary, it is still sufficiently difficult to understand what may happen on any given play.

Some trends have existed since the introduction of the current rules. Teams will continue to throw the ball on 3rd & Long and run it on 4th & Inches, but defenses know this, and part of the responsibility of calling plays is balancing playing to your strengths and following these trends with a need to go against the grain and surprise your opponent.

These decisions are designed to be unpredictable. So, that raises the question — how accurately can we actually guess whether a team will run or pass? That is the goal of this project, to understand if predictive modeling can understand trends and patterns within this variability and see just how accurately it can predict play calls.

Data Overview

The data for this project comes from nflverse, an R package built around play-by-play data. For this project, data from 2022 was used to perform an exploratory data analysis on the set of variables and make decisions about which may have the greatest impact of predicting play calls.

Within this data set, play_type is the target variable, a factor variable that has been filtered out to be either ‘run’ or ‘pass’. The nflverse package does an incredible job at keeping data clean, so missingness is not an issue. The only transformation that needs to occur is to remove observations from the data that are not runs or passes, such as special team plays and kneeldowns.

During the initial EDA, the play_type variable was examined to ensure that there would be no future issues caused by a severe skewness in the observations, which is split 58.5% pass and 41.5% runs.

Methods

Once the data was cleaned and prepared, more on that below, the data was initially split into an 80-20 proportion of training and testing sets, respectively. Resamples were then created from the training set using the cross-validation v-folding function vfold_cv, with four repeats and 10 partitions generating 40 different folds of the data. These folds were then used to fit the following model types.

All in all, 31,825 total plays were split into a training set of 25,459 and a testing set of 6,366. These plays exclusively come from the regular season the 2023 NFL campaign, and excludes Week 18, as this is a time where many team decide to rest their starters and significantly alter the way they play.

Recipes

Four recipes types were used, corresponding to different levels of information.

The Null

This recipe is pretty self-explanatory. Without any features, the most accurate prediction we can make is to see if runs or passes happen more often, and simply always choose the more-frequent one. In this case, that would mean always predicting a pass.

The Casual Fan

Any information used in this recipe must be visible on a TV broadcast. These include predictors like down and distance, field positioning, quarter number, and score differential.

These two recipes, once fit to their respective models, are intended to serve as baselines. It is not expected that these will perform as well as the following recipes, but serve an important purpose as they help determine the effectiveness of the other models and will maintain information relevant to potential questions of inference later on.

The Coach

This sets a baseline for what I believe to a reasonable accuracy rate for a person predicting plays just before they happen without any external aid. The recipe attempts to account for trends that a coach would know off-hand, like offensive 3rd down efficiency, goal-to-go scenarios, the last play that was ran, and other important external conditions. Any highly specific numbers and figures are not expected to be known.

The Computer

With all of the intricate details of the data set, there are a lot of exact measurements that can be used that a person may be able to closely intuit, but not fully know. This recipe uses all the information the coach recipe has, but also includes rolling success rates throughout the game and previous play-calling patterns, pre-snap probabilities of various drive outcomes and overall win likelihood, and season trends.

Both of these recipes are still much more simple than what factors into real decision making, but they are limited both intentionally and unintentionally by the constraints of the data set. These recipes have both alternate versions where categorical variables are dummied or not dummied so they can be used on the following model types.

Models

Aside from the null model specification, courtesy of parsnip, four other model types were used.

Logistic Regression

This model type was used only for the casual fan baseline. The specific engine is glmnet and the model was run with a 0.01 penalty value. No hyperparameters were tuned for this model.

K Nearest Neighbors

The k-nearest numbers engine is kknn. The neighbors hyperparameter was tuned across a range of 10 to 100 across 10 levels.

Boosted Tree

The boosted tree engine used was xgboost, and all three of min_n, mtry, and learn_rate were tuned. Five levels were used for each, with the default range used for min_n, a range of 1 to 8 or 15 for mtry, since the two recipes have different numbers of total variable used, and a range of -2 to -0.2 for the learn rate.

Random Forest

Finally, the random forest models were created using the ranger engine, and the min_n and mtry parameters were once again tuned, with min_n ranges of 2 to 10 and 4 to 18 and an mtry range of 4 to 20.

The results of all of these models were tabulated and judged in effectiveness using the accuracy metric and its standard error.

Model Building & Selection

Here are the results from the initial model tunings and fits onto the resamples. These averages and standard errors were computed using the 40 total folds mentioned above.

wflow_id mean std_err
null_model 0.5849798 0.0000154
casual_fan 0.6434072 0.0011693
knn_coach 0.6624474 0.0013212
knn_computer 0.6665424 0.0014948
boosted_coach 0.6920249 0.0014953
boosted_computer 0.7023057 0.0010648
forest_coach 0.6873799 0.0014446
forest_computer 0.7065965 0.0012046

This is what these results look like in plot form, with point estimates and error bars stretching one standard error above and below it.

Overall, these results look as expected given the amount of information provided in the individual recipes and the flexibility of the model types. The casual fan baseline model performed much better than the null model, increasing its accuracy by nearly 6% by only using 5 predictors. The KNN models performed even better, but the extremely flexible boosted tree and random forest models, which were also tuned the most, were in a category of their own.

Similarly, there was a consistent difference between the performance of the models using the coach and computer recipes, although the gap is not as large as expected. With the standard errors all being remarkably low, this is significant, but I anticipated the recipes playing a much larger role than model specifications in accuracy results.

One interesting artifact exists, with the best boosted tree model performing better than the best random forest one using the coach recipe type, but worse with the computer type. This is likely due to the increased flexibility of the random forest model engine, and with the coach recipe having significantly fewer predictors, this may have allowed it to hyperfocus on unimportant noise to a greater extent.

That being said, looking at these results, the random forest with the computer recipe type performed the best. This is not surprising to me, since it took by far the longest to compute at nearly 10 hours, and its best performing set of hyperparameters will be used going forward.

Final Model Analysis

So, the workflow from the best performing random forest model using the computer recipe type, which had a mtry parameter of 8 and a min_n of 20, was used to fit the final model. These are the results of the final model specification on the training set.

.metric .estimate
accuracy 0.7062520
rmse 0.4223283
mae 0.3750394
rsq 0.2552106

CONFUSION MATRIX

The random forest computer model performed almost identically in accuracy, which makes sense with the initial standard error values being so low. Also, now that the final model is being assessed, other metrics have been included. Namely, the gap between the RMSE and MAE values are of note.

The numerical figures come from the model’s computed probabilities of a given play being a pass. So, on average it was off by 0.375, but the root mean square error was sizably larger. This means that the model was quite sure of itself but missed, but these numbers are difficult to contextualize in isolation.

I did not mention above that nflverse has these probabilities built into the dataset with a column called xpass, short for expected pass probability. I transformed these to make run/pass predictions to compare accuracy rates and used xpass variable to compute the RMSE, MAE, and RSQ values.

.metric .estimate
accuracy 0.7012253
rmse 0.4241362
mae 0.3574407
rsq 0.2382495

Looking at the metrics, my final model looks quite similar. The built-in model had a lower accuracy, mean absolute error, and r-squared value, but a higher RMSE. So, comparatively speaking, the final model was slightly better at making predictions by overall accuracy, but was usually wrong by more when it missed, taking larger chances.

Here are the two densities side by side, with the red line being the built-in model and the blue one being the final model. We can see this trend in main peak at around 0.50, which is much smaller. To humanize the xpass model, some situations were sufficiently ambiguous where it decided to minimize error by guessing near the middle, while the final model attempted to key into small details in the data, helpful or not, to make closer guesses.

So, what does this mean more specifically about the final model and its ability to make predictions in specific circumstances.

down distance accuracy accuracy_diff count
1 Long 0.6521079 0.0364603 2633
1 Medium 0.6734694 0.0306122 98
1 Short 0.5945946 -0.0810811 37
2 Long 0.7227813 -0.0128088 1093
2 Medium 0.6397516 0.0198758 805
2 Short 0.7665198 0.0264317 227
3 Long 0.8846881 -0.0510397 529
3 Medium 0.8361905 -0.0742857 525
3 Short 0.6917293 -0.0112782 266
4 Long 0.9200000 -0.0800000 25
4 Medium 0.9148936 -0.0425532 47
4 Short 0.6790123 0.0123457 81

These two figures represent the same idea, with the graph showing total correct and incorrect predictions by actual play call, down, and distance, and the table showing accuracy rates regardless of the play call. Across the board, the final model always performed better than the null, however, it excelled in what are considered obvious situations.

Particularly when a pass is expected, like 3rd or 4th down with more than a few yards to go, accuracy rates were north of 80 or 90 percent. However, it struggled mightily on first down, especially in likely goal-to-go situations where the yardage is short.

Compared to the other model, it did worse in crucial situations like goal-to-go and third down, but did noticeably better in the most common situation: 1st and 10.

Conclusion

Overall, this project was able to yield a model that produced similar, if not slightly better results, than the model built into the nflverse dataset. However, there is room for improvement in accuracy by utilizing more information within the dataset, furthering tuning and feature engineering efforts, supplementing with additional information, and limiting the scope of the predictions.

These models were asked to create predictions across the entire league, encompassing all of the possible variation. However, individual teams and play callers will adapt to their specific surroundings and adjust accordingly, and that cannot be accounted for using the final model. This narrowing would add to the accuracy of any model, but was not included in the sake of this project for a sense of generality.

For example, here is a plot showing teams by how often they passed the ball last season and their average overall efficiency.

Let’s use an average team like Houston, so as not to gain an additional advantage by simply having a higher baseline or significant interaction with the efficiency features. Splitting only the Texan’s offensive plays into a training and testing set and allowing the final model to refit and produce a new estimate, there is a slight increase to 71% accuracy.

I believe that including more information, in the manners mentioned above, could create a model with a theoretical total accuracy rate between 75-80%, with some teams having individual rates upward of 85%. This is an extreme jump, but I believe this project was limited in scope and while this would necessitate external data like PFF scores, personnel information, and defensive rankings and likely significantly increase computational time, these are extremely important predictors absent from the data set. This could be extremely helpful in creating more advanced “Over Expectation” variables and, if kept somewhat intuitive, could prove somewhat useful in understanding tendencies.

References

All data used in this project is coming from the nflverse package. I’ve provided links to the repository and documentation to show how this data is collected and updated.